Avoid inconsistent quantile computation between architectures #80

gduranceau · 2024-06-24T10:11:12Z

What does this PR do?

For particular sketches and quantile values, GetValueAtQuantile can return very different results on different architectures, resulting from slightly different rounding on internal float values used in the computation. This PR exposes the issues in the tests and addresses the problem.

The problem was initially highlighted with a sketch containing 21 positive values, 14 of which being 0, and computing the 0.95 quantile of the sketch. In this case, we get:

rank = quantile * (count - 1) = 0.95 * (20 - 1) = 19

The number of zeroes in the sketch is then subtracted from this rank to find the key at rank 5 (19-14) in the positive-values store of the sketch.

0.95 does not have an exact representation as a float64 (0x1.e666666666666p-01). But when it is multiplied by 20, the rounding on the result lead to the value 19, which is represented in an exact way as float64 (0x1.3p+04). This is the case on both ARM64 and AMD64 systems on which I tested this. In GetValueAtQuantile, I could validate that rank for this particular sketch was computed consistently to this value 19 (0x1.3p+04).

I was surprised to see that when GetValueAtQuantile computes rank - s.zeroCount just after, the value was different on ARM64 and AMD64:

On AMD64, I got the value 5, represented exactly as a float64 (0x1.4p+02).
On ARM64, I got the value 5, but rounded down to 0x1.3ffffffffffffp+02.

This is enough to get a different bin for this rank, resulting in a very different 0.95 quantile for the sketch I was testing with.

Disassembling the code generated for GetValueAtQuantile showed that the rank variable was actually not reused by the compiler to compute rank - s.zeroCount. Instead, the fused multiply-add fnmsub instruction was used to compute quantile * (count - 1) - s.zeroCount directly. This instruction does not perform an intermediate rounding after the multiplication, which explains how the end result can be slightly different.

In GetValueAtQuantile, we want to make sure that the rank computed can be safely used as a reference point when later computing a rank to lookup in the negative-value or positive-value store. This can be achieved by an explicit floating point conversion:

rank := float64(quantile * (count - 1))

As per the go specification:

An implementation may combine multiple floating-point operations into a single fused operation, possibly across statements, and produce a result that differs from the value obtained by executing and rounding the instructions individually. An explicit floating-point type conversion rounds to the precision of the target type, preventing fusion that would discard that rounding.

This problem is not specific to the sketch I was testing with of course. It can happen whenever a sketch contains zero values, the computed quantile doesn't have an exact float64 representation (requiring rounding on the rank computation), and the rank to look for in the positive-value store falls exactly at the boundary between 2 bins.

The PR adds a test which exposes the issue on ARM64 when the explicit float64 conversion added by this PR is removed. It uses sketches built from values following a linear distribution, but with zeroes added once every 2 values. For the right test size (21) and quantile (0.95), the lower and upper quantiles computed on the Dataset are equal (because math.Floor(rank) and math.Ceil(rank) are equal), but the effective rank used by GetValueAtQuantile is rounded down, selecting the bin just below the expected one.

piochelepiotr

It's an interesting change, but I'm wondering why it was an issue?
The error on the value of a specific percentile was still kept within the bounds of relative accuracy right?

gduranceau · 2024-06-24T14:37:22Z

@piochelepiotr

The error on the value of a specific percentile was still kept within the bounds of relative accuracy right?

No. If you run the test I added without the float type conversion of rank, the test will fail, because the wrong bin is selected.

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [github.com/DataDog/sketches-go](https://togithub.com/DataDog/sketches-go) | `v1.4.5` -> `v1.4.6` | [![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fDataDog%2fsketches-go/v1.4.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/github.com%2fDataDog%2fsketches-go/v1.4.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/github.com%2fDataDog%2fsketches-go/v1.4.5/v1.4.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fDataDog%2fsketches-go/v1.4.5/v1.4.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>DataDog/sketches-go (github.com/DataDog/sketches-go)</summary> ### [`v1.4.6`](https://togithub.com/DataDog/sketches-go/releases/tag/v1.4.6) [Compare Source](https://togithub.com/DataDog/sketches-go/compare/v1.4.5...v1.4.6) #### What's Changed - Avoid inconsistent quantile computation between architectures by [@gduranceau](https://togithub.com/gduranceau) in [https://github.com/DataDog/sketches-go/pull/80](https://togithub.com/DataDog/sketches-go/pull/80) #### New Contributors - [@gduranceau](https://togithub.com/gduranceau) made their first contribution in [https://github.com/DataDog/sketches-go/pull/80](https://togithub.com/DataDog/sketches-go/pull/80) **Full Changelog**: DataDog/sketches-go@v1.4.5...v1.4.6 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib).  --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com>

Avoid inconsistent quantile computation between architectures

d0e372d

mnshdw approved these changes Jun 24, 2024

View reviewed changes

piochelepiotr approved these changes Jun 24, 2024

View reviewed changes

piochelepiotr merged commit f301b7a into DataDog:master Jun 24, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid inconsistent quantile computation between architectures #80

Avoid inconsistent quantile computation between architectures #80

gduranceau commented Jun 24, 2024

piochelepiotr left a comment

gduranceau commented Jun 24, 2024

Avoid inconsistent quantile computation between architectures #80

Avoid inconsistent quantile computation between architectures #80

Conversation

gduranceau commented Jun 24, 2024

What does this PR do?

piochelepiotr left a comment

Choose a reason for hiding this comment

gduranceau commented Jun 24, 2024